Computational Biology and Chemistry — Latest Matching Preprints

1

Supervised learning of protein thermal stability using sequence mining and distribution statistics of network centrality

Sharma, A.; Bagler, G.; Bera, D.

2019-09-24 bioinformatics 10.1101/777177 medRxiv

Top 0.1%

15.2%

Show abstract

MotivationIt is expected that the difference in the thermal stability of mesophilic and thermophilic proteins arises, in part at least, from the differences in their molecular structures and amino acid compositions. Existing machine learning approaches for supervised classification of proteins rely on the features derived from the structural networks and the amino acid sequences. However, the network features used leave out several important network centrality values, the statistic used is a simple average and the sequence features used are hand-picked leading to an accuracy of 90%.\n\nResultsWe show that discriminating sub-sequences of the amino acid sequences can significantly improve classification accuracy compared to the existing approaches of counting amino acids, di-peptide or even tri-peptide bonds. We identify notions of network centrality, specifically that depends on the distances between C atoms, that appears to correlate better with thermal stability compared to the existing network features. We also show how to generate better statistics from the node- and edge-wise centrality values that more accurately captures the variations in their values for different types of proteins. These improved feature selection techniques make it possible to classify between thermophilic and mesophilic proteins with 96% accuracy and 99% area under ROC.\n\nAvailabilityThe dataset and source code used are available at https://github.com/ankits0207/Protein_Classification_BIO699\n\nContactdbera@iiitd.ac.in\n\nonline.

2

Predicting condensate formation of protein and RNA under various environmental conditions

Chin, K. Y.; Ishida, S.; Terayama, K.

2023-06-05 bioinformatics 10.1101/2023.06.01.543215 medRxiv

Top 0.1%

10.7%

Show abstract

MotivationLiquid-liquid phase separation (LLPS) by biomolecules plays a central role in various biological phenomena and has garnered significant attention. The behavior of LLPS is strongly influenced by the characteristics of the RNAs and environmental factors such as pH and temperature, as well as the properties of the proteins. Recently, several databases of biomolecules associated with LLPS have been established, and prediction models of LLPS-related phenomena have been explored, leveraging these databases. However, a prediction model that concurrently considers proteins, RNAs, and experimental conditions has not been developed due to the limited information available from individual experiments in public databases. ResultsTo address this challenge, we have built a new dataset called RNAPSEC, which serves each individual experiment as a data point. This dataset was accomplished by manually collecting data from public literature. Utilizing RNAPSEC, we developed two distinct models that consider a protein, RNA, and experimental conditions. The first model can predict the LLPS behavior of a protein and RNA under specific conditions. The second model can predict the required conditions for a given protein and RNA to undergo LLPS. RNAPSEC and these prediction models are expected to accelerate our understanding of the roles of proteins, RNAs, and environmental factors in LLPS. AvailabilityThe codes for the prediction models and RNAPSEC are available at https://github.com/ycu-iil/RNAPSEC. Contactterayama@yokohama-cu.ac.jp

3

In silico characterization, homology modeling and structure-based functional annotation of Nile Tilapia (Oreochromis niloticus) Hsp70 and Hsc70 proteins

Dayrit, G. B.; Burigsay, P. F.; Vera Cruz, E. M.; Santos, M. D.

2023-12-27 bioinformatics 10.1101/2023.12.27.573401 medRxiv

Top 0.1%

10.0%

Show abstract

BackgroundThe molecular chaperones known as heat shock proteins 70 (Hsp70) and heat shock cognate protein 70 (Hsc70) are vital for maintaining cellular integrity and controlling stress. MethodologyThe On-Hsp70 and On-Hsc70 proteins from Nile tilapia (Oreochromis niloticus) have been thoroughly examined in this study using in silico analysis, homology modeling, and functional annotation. Homology modeling was carried out using the SWISS- MODEL program, and the proposed model was assessed for its high reliability through analyses including ProSA, Verify 3D, PROVE, ERRAT, and Ramachandran plot. ResultsThe essential features of the On-Hsp70 and On-Hsc70 proteins encompass amino acid lengths (640 and 645), molecular weights (70,233.48 and 70,773.17 Da), theoretical isoelectric points (pI = 5.63 and 5.28), and the overall counts of negatively and positively charged residues (95 and 86; 95 and 81). Furthermore, the instability index (II) values were 35.27 (On-Hsp70) and 38.85 (On-Hsc70). Similarly, the aliphatic index (AI) exhibited high values for both proteins, reaching 84.58 (On-Hsp70) and 82.85 (On-Hsc70). On-Hsp70 and On-Hsc70 were both shown to contain an MreB/Mbl domain. DiscussionThe authors found that On-Hsp70 and On-Hsc70 share key characteristics, including an acidic nature, high stability, and conserved domains. Protein-protein interaction analysis identified the co-chaperone Stip1 as a primary functional partner. Comparative modeling yielded highly reliable 3D models, revealing structural similarity to known proteins and predicted binding sites. Furthermore, the primary subcellular localization of both proteins is the cytoplasm. Functional analysis predicted an AMP-PNP binding site for On-Hsp70 and ATP binding site for On-Hsc70. ConclusionThe discoveries deepen our understanding of Hsc70 and Hsp70 in Nile tilapia, highlighting their importance in fish physiology and positioning them as crucial study topics moving forward. This study adds to our understanding of the actions of these proteins in cellular processes and stress responses, which could impact fish health and resilience.

4

A Comprehensive Enumeration of the Human Proteostasis Network. 1. Components of Translation, Protein Folding, and Organelle-Specific Systems

The Proteostasis Consortium, ; Elsasser, S.; Finley, D.; Mockler, E.; Lima, L.; Finkbeiner, S.; Gestwicki, J. E.; Stoeger, T. E.; Cao, K.; Garza, D.; Kelly, J. W.; Collier, M.; Rainbolt, T. K.; Taguwa, S.; Chou, C.-C.; Aviner, R.; Barbosa, N.; Moralez-Polanco, F.; Masto, V. B.; Frydman, J.; Elia, L. P.; Morimoto, R. I.; Powers, E. T.

2022-09-02 bioinformatics 10.1101/2022.08.30.505920 medRxiv

Top 0.1%

9.9%

Show abstract

The condition of having a healthy, functional proteome is known as protein homeostasis, or proteostasis. Establishing and maintaining proteostasis is the province of the proteostasis network, approximately 2,500 genes that regulate protein synthesis, folding, localization, and degradation. The proteostasis network is a fundamental entity in biology with direct relevance to many diseases of protein conformation. However, it is not well defined or annotated, which hinders its functional characterization in health and disease. In this series of manuscripts, we aim to operationally define the human proteostasis network by providing a comprehensive, annotated list of its components. Here, we provide a curated list of 959 unique genes that comprise the protein synthesis machinery, chaperones, folding enzymes, systems for trafficking proteins into and out of organelles, and organelle-specific degradation systems. In subsequent manuscripts, we will delineate the human autophagy-lysosome pathway, the ubiquitin-proteasome system, and the proteostasis networks of model organisms.

5

Identifying potential therapeutic targets for T-cell acute lymphoblastic leukemia using malignant networks and topological analysis

Ramos, R. H.; Carels, N.; Scardini, R.; Franca, L. L.; Simao, A.; Ferreira, C. d. O. L.; Carneiro, F. R. G.

2025-10-09 cancer biology 10.1101/2025.10.08.681235 medRxiv

Top 0.1%

9.8%

Show abstract

T-cell acute lymphoblastic leukemia (T-ALL) is an aggressive and heterogeneous disease requiring new therapeutic targets. We identified overexpressed genes in T-ALL cell lines, built subnetworks using the IntAct interactome, and evaluated the topological role of each gene. An attack strategy, removing one gene at a time, was applied with eleven network measures and persistent homology. A positive control group of essential genes for T-ALL tumorigenesis was included. Clustering, largest connected component, and especially Betti 1 effectively distinguished control genes from others, showing that topological metrics are valuable for target identification. Notably, NPM1 emerged as a key gene for maintaining network integrity, promoting proliferation, and ensuring survival, highlighting its potential as a promising therapeutic target in T-ALL. Significance of the studyThis study proposes an innovative approach to finding therapeutic targets for T-ALL based on oncogenic protein-protein interaction networks and the identification of the most effective analysis metrics following network attacks. These findings can be broadly applied to any neoplasm, seeking more personalized and efficient therapy.

6

Comparative Transcriptomic Analysis of ATRA-Resistant and ATRA-Sensitive APL Cell Lines Identifies LncRNA Biomarkers Associated with Drug Resistance

Marimuthu, O.; Shinde, N.; Sella, R. N.

2026-01-30 cancer biology 10.64898/2026.01.27.702191 medRxiv

Top 0.1%

9.8%

Show abstract

Acute promyelocytic leukemia is a distinct subtype of acute myeloid leukemia characterized by the t(15;17) translocation, leading to the PML (Promyelocytic leukemia protein)-RARA (Retinoic Acid Receptor Alpha) fusion protein. Although PML-RARA fusion is common, there are 20 more fusion events also reported in APL. All -trans retinoic acid (ATRA) is a standard drug for APL, leading to significant improvement in patient outcomes; nevertheless, a small fraction of patients still experience relapse, and some patients exhibit resistance to the drug. Long non-coding RNAs (LncRNAs) are recognized as promising biomarkers for cancer diagnosis, prognosis, and treatment response. In this study, we used ATRA-Resistant (AP1060) and ATRA -Sensitive (NB4), both treated and untreated cell line transcriptomic data retrieved from the NCBI Gene Expression Omnibus(GEO) database to perform transcriptomic analysis with bioinformatic tools. We utilized the LncRAnalyzer pipeline to predict the lncRNAs, followed by differential expression analysis using DESeq2. Weighted Gene Co-expression Network Analysis (WGCNA) was employed to construct lncRNA co-expression modules associated with ATRA resistance. BEDTools is used to identify cis-acting target genes of lncRNAs.LncRNA -miRNA sponging identified by miRanda algorithm. The identified miRNAs reveal their significant role in APL and other leukemia subtypes. The results of the study show that the identified lncRNAs from the miRNA-LncRNA network are promising biomarkers for ATRA resistance.

7

Discovery of Novel R-Selective Aminotransferase Motifs through Computational Screening

Runthala, A.; Satya Sri, P. S.; Nair, A. S.; Puttagunta, M. K.; Sekhar Rao, T. C.; Sreya, V.; Sowmya, G. R.; Reddy G, K.

2024-08-22 bioinformatics 10.1101/2024.08.21.608959 medRxiv

Top 0.1%

9.7%

Show abstract

Transaminases, enzymes facilitating amino group transfers, are divided into four subfamilies: D-alanine transaminase (DATA), L-selective Branched chain aminotransferase (BCAT), 4-amino-4-deoxychorismate lyase (ADCL), and R-selective aminotransferase (RATA). RATA enzymes are particularly valuable in biocatalysis for synthesizing chiral amines and resolving racemic mixtures, yet their identification in sequence databases is challenging due to the lack of robust motif-based screening methods. By constructing a transaminase sequence dataset and categorizing them into subfamilies, we re-screened conserved motifs and explored novel ones. Phylogenetic clustering and structural localization of these motifs on Alphafold-predicted protein models validated their importance. For ADCL, BCAT, DATA, and RATA datasets, we discovered 5, 7, 10, and 2 novel motifs, respectively. Additionally, unique residue patterns were identified, underscoring their structural significance. This motif-based computational approach promises to unveil novel RATA enzymes for biocatalytic applications.

8

Debunking the "junk": Unraveling the role of lncRNA-miRNA-mRNA networks in fetal hemoglobin regulation

Rahaman, M.; Bhowmick, C.; Komanapalli, J.; Mukherjee, M.; Byram, P. K.; Shukla, P. C.; Dolai, T. K.; Chakravorty, N.

2021-10-14 bioinformatics 10.1101/2021.10.13.464339 medRxiv

Top 0.1%

8.2%

Show abstract

Fetal hemoglobin (HbF) induction is considered to be a promising therapeutic strategy to ameliorate the clinical severity of {beta}-hemoglobin disorders, and has gained a significant amount of attention in recent times. Despite the enormous efforts towards the pharmacological intervention of HbF reactivation, progress has been stymied due to limited understanding of {gamma}-globin gene regulation. In this study, we intended to investigate the implications of lncRNA-associated competing endogenous RNA (ceRNA) interactions in HbF regulation. Probe repurposing strategies for extraction of lncRNA signatures and subsequent in silico analysis on publicly available datasets (GSE13284, GSE71935 and GSE7874) enabled us to identify 46 differentially expressed lncRNAs (DElncRNAs). Further, an optimum set of 11 lncRNAs that could distinguish between high HbF and normal conditions were predicted from these DElncRNAs using supervised machine learning and a stepwise selection model. The candidate lncRNAs were then linked with differentially expressed miRNAs and mRNAs to identify lncRNA-miRNA-mRNA ceRNA networks. The network revealed that 2 lncRNAs (UCA1 and ZEB1-AS1) and 4 miRNAs (hsa-miR-19b-3p,hsa-miR-3646,hsa-miR-937 and hsa-miR-548j) sequentially mediate cross-talk among different signaling pathways which provide novel insights into the lncRNA-mediated regulatory mechanisms, and thus lay the foundation of future studies to identify lncRNA-mediated therapeutic targets for HbF reactivation.

9

Using Deep Learning with Different Architectures to Recognize RNA:DNA Triplex Structures from Histone Modification Features

Tsenum, J. L.

2025-11-24 bioinformatics 10.1101/2025.09.16.676231 medRxiv

Top 0.1%

8.0%

Show abstract

Long non-coding RNAs (lncRNAs) can perform their regulatory roles by forming triple helices through RNA-DNA interaction. Although this has been verified by few in vivo and in vitro methods, in silico approaches that seek to predict the potentials of lncRNAs and DNA sites becoming a triplex forming structure is required. Triplexator have also predicted vast amounts of lncRNAs and DNA sites that has the potentials of becoming a triplex structure. There is also an emerging experimental-evidence that the presence of epigenetic marks at DNA sites and lncRNAs can facilitate the formation of RNA:DNA triplex structures. There is therefore, a huge demand for computati onal approaches such as deep learning that can make novel predictions about RNA:DNA triplex structure formation. In this study, we developed four (4) deep neural network models that can predict the potentials of lncRNAs and DNA sites to form triple helices genome-wide, by taking histone modification marks as features. Our data was first passed through the Triplexator to screen out lncRNAs and DNA sites with low potentials of forming triple helices. We used different deep learning architectures to build our models, including two-layer convolutional neural networks (CNN) and multilayer perceptron (MLP). Our DNA2_CNN model performed best at a mean AUC of 0.78 at 32 Kernel size and learning rate of 1e-3. Our deep neural network models revealed several novel lncRNAs and DNA sites, including HOTAIR, MEG3, PARTICLE, DACOR1, MIR100HG, FENDRR, ANRIL, TUG1, MALAT1, LINC00599, TINCR, NEAT1, roX2, DHFR, OTX2-AS1, Xist, SNHG16, ATXN8OS, BCYRN1, TERC, Khps1, that have the potential of forming triplex structures, thereby confirming previous experimental results and that of the Triplexator. The performance of our models also supports previous findings that histone modification marks can help in identifying lncRNAs and DNA regions that have the potentials of forming RNA:DNA triplex structures. In conclusion, we showed that different deep learning architectures can recognize lncRNAs and DNA that have the potentials of forming RNA:DNA triplex structures.

10

The PVT1, HULC, and HOTTIP expression changes due to treatment in Diffuse Large B-cell lymphoma

Shahsavari, M.; Arbabian, S.; Hosseini, F.; Razavi, M. R.

2024-08-07 cancer biology 10.1101/2024.08.05.606587 medRxiv

Top 0.1%

7.9%

Show abstract

Diffuse large B-cell lymphoma is the most common histological subtype of non-Hodgkins lymphomas. It is an aggressive malignancy that displays great heterogeneity in morphology, genetics, biological behavior and treatment response owing to chromatin remodeling and epigenetics. Bioinformatic-based approaches were used to understand the possible signaling pathways of the three lncRNAs PVT1, HULC, and HOTTIP. Furthermore, their expression levels were quantitatively evaluated in 100 patients before and after the treatment. The results revealed that gene expression was significantly upregulated in PVT1, HULC, and HOTTIP by 7.39{+/-}8.48-, 5.924{+/-}7.536-, and 4.137{+/-}5.863 fold, respectively, relative to normal cases. Post-treatment measurement of lncRNA expression indicated that PVT1 and HOTTIP were significantly downregulated. Interestingly, the expression levels of PVT1, HULC, and HOTTIP were significantly higher in DLBCL patients aged > 60 years than in those aged < 60 years. In addition, there was a significant positive correlation between HULC and HOTTIP expression. The analysis of overexpressed lncRNA-miRNA interaction indicated different deregulated miRNA targets and the protein targets of upregulated lncRNAs are mainly with histone modification, DNA methylation/demethylation, and protein methyltransferase activity. Summary blurbThe lncRNAs PVT1, HULC, and HOTTIP expression is significantly upregulated before treatment and reduce to normal level after treatment. It can be used as diagnostic marker or prognostic means especially in Relapsed/refractory DLBCL.

11

Physical-Chemical Features Selection Reveals That Differences in Dipeptide Compositions Correlate Most with Protein-Protein Interactions

Teimouri, H.; Medvedeva, A.; Kolomeisky, A. B.

2024-03-01 bioinformatics 10.1101/2024.02.27.582345 medRxiv

Top 0.1%

7.8%

Show abstract

The ability to accurately predict protein-protein interactions is critically important for our understanding of major cellular processes. However, current experimental and computational approaches for identifying them are technically very challenging and still have limited success. We propose a new computational method for predicting protein-protein interactions using only primary sequence information. It utilizes a concept of physical-chemical similarity to determine which interactions will most probably occur. In our approach, the physical-chemical features of protein are extracted using bioinformatics tools for different organisms, and then they are utilized in a machine-learning method to identify successful protein-protein interactions via correlation analysis. It is found that the most important property that correlates most with the protein-protein interactions for all studied organisms is dipeptide amino acid compositions. The analysis is specifically applied to the bacterial two-component system that includes histidine kinase and transcriptional response regulators. Our theoretical approach provides a simple and robust method for quantifying the important details of complex mechanisms of biological processes.

12

A map of Non-translated RNA (nt-RNA) junctions in cancer genomes: a database resource of unproductive splicing

Huang, D.; Kwan, T.-K.; Ma, S.-L.; Tang, N. L.-s.

2025-06-19 cancer biology 10.1101/2025.06.15.659434 medRxiv

Top 0.1%

7.8%

Show abstract

BackgroundNon-translated transcripts (nt-RNAs) with frame-shifts or premature termination codons resulting from alternative splicing events (ASE), have been recently found at unexpectedly abundant in transcriptomes of cancer tissue. However, their full genomic spectrum has not yet been fully elucidated. This study comprehensively characterised the expression of signature junctions of these nt-RNA (termed "toxic junctions" here) of both known and novel nt-RNA across multiple cancer types and investigated their potential as biomarkers. MethodsRNA-seq data of [~]6,000 samples, including the tumor and normal samples for 13 cancer types were retrieved from The Cancer Genome Atlas database (TCGA) together with data from Cancer Cell Line Encyclopedia (CCLE) project. Due to the difficulty in quantifying the entire transcript isoform of nt-RNA, we pioneered an algorithm to focus exclusively on the expression of junctional reads, which also circumvented the limitation of non-directional RNA- seq of TCGA data. We showed that the majority of nt-RNA is associated with at least one toxic junction. We built a comprehensive catalogue of known nt-RNA toxic junctions from genome databases. And novel toxic junctions were also identified by a new junction-focused algorithm from the higher quality discovery subsets of TCGA data. Splicing in Ratio (SiR) was used to quantify ASE leading to nt-RNA, enabling: Differential expression analysis between cancer and normal tissue and across cancer types. Identification of different profiles of nt-RNA abundance and various factor which may be the causes of differential nt-RNA abundance and SiR results Identification of specific nt-RNA and toxic junctions that were expressed in various cancer (and/or normal tissue) types. Assessment of nt-RNA and their toxic junction expression as biomarkers or prognosis indicators. ResultsWe profiled the expressed known nt-RNA (toxic) junctions of known transcripts and discovered [~]22,000 novel toxic junctions out of [~]250,000 novel junctions found in the transcriptome data. The expression of nt-RNA was as high as 10% of all transcripts of the corresponding gene in cancer transcriptomes. Interestingly, some signature toxic junctions of nt-RNA are expressed in even higher quantities, e.g. up to 50% or more, which is reminiscent of a heterozygous mutation. We identified distinct patterns between cancer and normal samples, including example of nt-RNA expressing toxic junctions exclusively in normal or tumor samples. Clinically relevant examples included ANXA6 in breast cancer, where the nt-RNA isoform showed significantly higher expression in tumors (p=1.8e-15). In kidney renal clear cell carcinoma (KIRC), a significant isoform switch of ESYT2 based on the RNA-seq data was confirmed. The Kaplan-Meier survival curves showed that samples with the higher expression ratio of ESYT2-L are associated with better survival (p=2.0e-06). Unsupervised clustering showed that SiR results of 150 toxic signatures defined 4 subgroups of patients with different prognosis. Through principal component analysis (PCA), PC1 and PC2 can be used as an independent prognosis biomarkers. nt-RNA accounting for these PCs included splicing factors SRSF3 and CLK1, where CLK1 phosphorylates SRSF3 to promote exon 4 inclusion in both genes. ConclusionsIn summary, the expression profiles of all known and novel toxic junctions were explored using pan-cancer RNA-seq data. A dual 10% rule emerged from this study: [~]10% of novel junctions were toxic junctions associated with nt-RNA, and up to 10% of RNA transcripts inside a cell were also nt-RNA. The SiR metric enables accurate quantification of unproductive splicing and identification of cancer biomarkers. Our findings reveal that unproductive splicing represents functionally important post-transcriptional regulation in cancer. These expression profiles allow researchers to study the expression of nt-RNA signature junctions or novel signature junctions in or near the genes they are interested in, which could provide a new direction for their research. The SRSF3-CLK1 regulatory mechanism provides insights into splicing dysregulation. Our comprehensive toxic junction catalogue serves as a valuable resource, suggesting that targeting unproductive splicing pathways may offer novel therapeutic strategies for cancer treatment. Data availabilityThe catalogue is available on GitHub and UCSC browser. https://github.com/danhuang0909/nt_database for GitHub overview https://genome.ucsc.edu/s/dandan_0909/hg38_all_new_nr for genome browsing of all novel (unannotated) toxic junctions https://genome.ucsc.edu/s/dandan_0909/hg38_5_26 for toxic junctions in known (annotated) nt-RNA.

13

AntiCP3: Prediction of Anticancer Proteins Using Evolutionary Information from Protein Language Models

Gupta, A.; Chauhan, M.; Tomer, R.; Raghava, G. P. S.

2025-05-03 bioinformatics 10.1101/2025.04.29.651196 medRxiv

Top 0.1%

7.8%

Show abstract

A number of computational methods have been developed in the past for predicting anticancer peptides, including AntiCP and AntiCP2 from our group. While these tools have been widely used by the scientific community, they are not suitable for predicting anticancer proteins. In this study, we present AntiCP3, the first dedicated method for the prediction of anticancer proteins. All models were trained using five-fold cross-validation and evaluated on an independent dataset not used during training. Our initial analysis revealed distinct compositional differences between anticancer peptides and proteins, justifying the need for a separate prediction framework. We first implemented similarity-based approaches, which yielded moderate performance. Subsequently, we developed machine learning and deep learning models using conventional protein features, achieving a maximum AUC of 0.72. The performance improved to an AUC of 0.79 with the incorporation of evolutionary information through PSSM profiles. Further enhancement was observed when embeddings from a fine-tuned protein language model ESM-t33 were used, leading to a best AUC of 0.90. Finally, a hybrid approach combining BLAST with our machine learning model achieved an AUC of 0.91. To facilitate the scientific community, we have implemented AntiCP3 as both a web server and standalone software for the prediction of anticancer proteins (https://webs.iiitd.edu.in/raghava/anticp3/). We have also deployed our model at hugging face https://huggingface.co/raghavagps-group/anticp3. Highlights Existing methods have been developed for predicting anticancer peptides. AntiCP3 is specifically optimised for predicting anticancer proteins. PSI-BLAST used to obtain evolutionary information in form of PSSM profile. Hybrid methods developed using alignment free and alignment based approach. A web server and standalone tool have been developed to assist the community.

14

Development of an EMT-related exosomal miRNA signature that can predict prognosis in hepatocellular carcinoma

Missaghimamaghani, O.; Nehri, L. N.; Bakhshi, S.; Karaosmanoglu, O.; Sivas, H.; Acar, A. C.; Banerjee, S.

2025-09-11 cancer biology 10.1101/2025.09.06.674651 medRxiv

Top 0.1%

7.7%

Show abstract

Chemoresistance and epithelial-mesenchymal transition (EMT) are associated with failure of cancer chemotherapy and poor survival of patients. We have previously shown that chemoresistance and stemness in hepatocellular carcinoma (HCC) cells was accompanied by the development of partial EMT (p-EMT) and identified a number of EMT-associated proteins that are released from exosomes. In this study, we aimed to identify and classify the differentially expressed (DE) exosomal miRNAs from chemoresistant HuH7 cells undergoing p-EMT. Out of the fifty-four miRNAs that were enriched in the exosomes from these cells compared to controls, thirteen were identified in the exosomes isolated from the serum of HCC patients. These miRNAs targeted genes that were associated with cell-cell junctions, extracellular matrix, cytoskeleton, transcription and signal transduction. Univariate Cox regression analysis indicated that 11/13 miRNAs were associated with either favorable (n=4) or worse (n=7) prognosis. A machine learning algorithm indicated that seven miRNAs (miR 215-5p, miR 340-5p, miR 210-3p, miR 19a-3p, miR19b-3p, miR 1266-5p and miR 25-3p) could predict worse prognosis in multiple datasets with 64-68% accuracy. A Bayesian Inference network analysis with the thirteen miRNAs and their key target proteins, along with EMT and survival as the nodes indicated that the common denominator was transcription, suggesting that the exosomal miRNAs released from cells undergoing p-EMT can mediate phenotypic changes in cells through transcriptional regulation.

15

Comparative study of manganese catalase monomers, interfaces and cage architecture within the ferritin superfamily

Amrita, A.; Chakraborti, S.; Dey, S.

2025-06-20 bioinformatics 10.1101/2025.06.17.660058 medRxiv

Top 0.1%

7.4%

Show abstract

Manganese catalase protein is an example of a protein cage within the ferritin superfamily that focuses on enzymatic catalysis rather than on storage which other ferritin proteins are known for. Formed of 6 homomeric chains, it shows large subunit-subunit sidewise contacts and special interfacial interactions ensuring cage generation with only 6 subunits. We aim to explore manganese catalase at the monomer, subunit-subunit and cage level. As compared to other ferritin subtypes, we found manganese catalase monomers to have greater fraction of {beta}-turns at the secondary structure level and thermophilic manganese catalase monomers to have greater non-polar to polar residue ratio. At the interface level, the placement of subunits in manganese catalase cage with sidewise and angular interface was found to be distinctly different from the parallel and perpendicular interfaces present in other ferritin subtypes. Our study highlights the contribution of sidewise interfaces and terminal extensions in making the cage architecture possible in 6-mer manganese catalases. We also studied the quaternary structure of manganese catalase cage with respect to Classical ferritin (C-ferritin) and found manganese catalase to have smaller cavity volume and cavity surface area than C-ferritin but higher cavity surface to volume ratio. This observation along with the smaller distance between substrate entry point and active site as compared to C-ferritin highlights the structural distinction between catalytic enzymes like manganese catalase and storage proteins like ferritins. Thus, the study gives structural insight into manganese catalase protein cage with focus on its ability to form a cage architecture and show efficient catalytic activity.

16

RUDEUS, a machine learning classification system to study DNA-Binding proteins

Medina-Ortiz, D.; Cabas-Mora, G.; Moya-Barria, I.; Soto-Garcia, N.; Uribe-Paredes, R.

2024-02-21 bioinformatics 10.1101/2024.02.19.580825 medRxiv

Top 0.1%

7.3%

Show abstract

DNA-binding proteins are essential in different biological processes, including DNA replication, transcription, packaging, and chromatin remodelling. Exploring their characteristics and functions has become relevant in diverse scientific domains. Computational biology and bioinformatics have assisted in studying DNA-binding proteins, complementing traditional molecular biology methods. While recent advances in machine learning have enabled the integration of predictive systems with bioinformatic approaches, there still needs to be generalizable pipelines for identifying unknown proteins as DNA-binding and assessing the specific type of DNA strand they recognize. In this work, we introduce RUDEUS, a Python library featuring hierarchical classification models designed to identify DNA-binding proteins and assess the specific interaction type, whether single-stranded or double-stranded. RUDEUS has a versatile pipeline capable of training predictive models, synergizing protein language models with supervised learning algorithms, and integrating Bayesian optimization strategies. The trained models have high performance, achieving a precision rate of 95% for DNA-binding identification and 89% for discerning between single-stranded and doublestranded interactions. RUDEUS includes an exploration tool for evaluating unknown protein sequences, annotating them as DNA-binding, and determining the type of DNA strand they recognize. Moreover, a structural bioinformatic pipeline has been integrated into RUDEUS for validating the identified DNA strand through DNA-protein molecular docking. These comprehensive strategies and straightforward implementation demonstrate comparable performance to high-end models and enhance usability for integration into protein engineering pipelines.

17

Structural characterization of a novel luciferase-like-monooxygenase from Pseudomonas meliae an in silico approach

Rayhan, M.; Siddiquee, M. F. R.; Shahriar, A.; Ahmed, H.; Mahmud, A. R.; Alam, M. S.; Uddin, M. R.; Acharjee, M.; Shimu, M. S. S.; Shamsir, M. S.; Emran, T. B.

2023-03-29 bioinformatics 10.1101/2023.03.27.534437 medRxiv

Top 0.1%

7.3%

Show abstract

BackgroundLuciferase is a well-known oxidative enzyme that produces bioluminescence. The Pseudomonas meliae is a plant pathogen that causes wood rot on nectarine and peach and possesses a luciferase-like monooxygenase. After activation, it produces bioluminescence, and the pathogens bioluminescence is a visual indicator of diseased plants. MethodsThe present study aims to model and characterize the luciferase-like monooxygenase protein in P. meliae for its similarity to well-established luciferase. In this study, the luciferase-like monooxygenase from P. meliae infects chinaberry plants has been modeled first and then studied by comparing it with existing known luciferase. Also, the similarities between uncharacterized luciferase from P. meliae and template from Geobacillus thermodenitrificans were analyzed to find the novelty of P. meliae. ResultsThe results suggest that the absence of bioluminescence in P. meliae could be due to the evolutionary mutation in positions 138 and 311. The active site remains identical except for two amino acids; P. meliae Tyr138 instead of His138 and Leu311 instead of His311. Therefore, the P. meliae will have a potential future application, and mutation of the residues 138 and 311 can be restored luciferase light-emitting ability. ConclusionsThis study will help further improve, activate, and repurpose the luciferase from P. meliae as a reporter for gene expression.

18

Machine Learning and Network Analysis to Predict Hypothetical Protein Functions of Aeromonas hydrophila

pirim, h.; Rahman, Z.; Saei, S.; Gyawali, S.; Marufuzzaman, M.; Tajik, N.; Tekedar, H. C.

2025-07-29 bioinformatics Community evaluation 10.1101/2025.07.22.666223 medRxiv

Top 0.1%

7.3%

Show abstract

Aeromonas hydrophila, antibiotic resistant gram negative bacteria, is a major fish pathogen. Moreover, A. hydrophila is considered to cause 13% of gastroenteritis cases in the United States. Therefore, it is important to identify groups of proteins that are effective in antibiotic resistance and causing mortality in aquaculture. We train machine learning models on existing A. hydrophila genomes to predict functions of 83 carefully filtered hypothetical proteins. Network analysis is conducted to cluster these proteins based on their similarities. Both ML and network analysis inform about possible roles of these proteins in vaccine candidacy and fish mortality.

19

Unraveling the Impact of Gene Length on Kinetic Parameters: Implications in Drug Target selection

choudhuri, s.; Ghosh, B.

2024-09-02 bioinformatics 10.1101/2024.08.31.610572 medRxiv

Top 0.1%

7.3%

Show abstract

Gene expression is a multifaceted process crucial to understanding molecular biology and pharmacology. Our research focuses on elucidating the intricate relationship between gene length and kinetic parameters, such as Si, Kon, Koff, and SKoff, which significantly influence the mean expression levels of genes.Using a two-state stochastic gene expression model implemented in Python, we analyzed single-cell transcriptomics data to predict kinetic parameters for each gene. We classified genes into short and long categories, revealing distinct patterns in the relationship between gene length and these parameters. Our results indicate that burst size plays a critical role in mean expression, highlighting its importance for identifying gene targets that require lower drug doses for therapeutic effects.

20

Homology Modelling, In Silico Prediction And Characterization Of Cytochrome c oxidase In Cyprinus carpio And Tubifex tubifex And Molecular Docking Studies Between The Modelled Protein And Three Commonly Used Surfactants Sodium Dodecyl Sulphate, Cetylpyridinium Chloride And Sodium Laureth Sulphate

Bhattacharya, R.; Daoud, I.; Chatterjee, A.; Chatterjee, S.; Saha, N. C.

2021-07-10 bioinformatics 10.1101/2021.07.09.451643 medRxiv

Top 0.1%

7.1%

Show abstract

The purpose of this work is to evaluate the homology modeling, in silico prediction, and characterisation of Cytochrome c oxidase from Cyprinus carpio and Tubifex tubifex, as well as molecular docking experiments between the modelled protein and three frequently used surfactants. Using the template crystal structure of bovine heart Cytochrome c oxidase, homology modeling of Cytochrome c oxidase (Subunit 2) of Cyprinus carpio (Accession # P24985) and Cytochrome c oxidase (Subunit 1) of Tubifex tubifex (Accession # Q7YAA6) was conducted. The model structure was improved further with 3Drefine, and the final 3D structure was verified with PROCHEK and ERRATA. The physiochemical, as well as the stereochemical parameters of the modelled protein, were evaluated using various tools like ExPASys ProtParam, Hydropathy Analysis and EMBOSS pepwheel. The projected model was then docked with toxic ligands, Sodium dodecyl sulfate (SDS), Cetylpyridinium chloride (CPC), and Sodium laureth sulfate (SLES), whose 3D structures were obtained from the Uniprot database. CPC interacted best with Cytochrome c oxidase subunit 2 of Cyprinus carpio and Cytochrome c oxidase subunit 1 of Tubifex tubifex, according to our findings. Furthermore, in the case of all surfactants, hydrophobic interactions with the active site amino acid residues of the modelled protein were observed to be more common than hydrogen bonds and salt bridges. Molecular simulation studies exhibited that the surfactants alter the structural flexibility of the predicted proteins. Hence it may be inferred that the surfactants might alter the structure and dynamics of Cytochrome c oxidase of both worm and fish.